The following 5 articles will talk about some data visualization tools and different types of graphs.
# 載入所需套件 import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
A run-sequence plot is a graph that displays observed data in a time sequence.
# 創造一些隨機資料 create some data with random value
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum() # 計算累積值 cumulative sum
df = pd.DataFrame(np.random.randn(1000, 4),
index=ts.index, columns=list('ABCD'))
df = df.cumsum()
plt.figure(); # 定義一個圖像窗口 define an image window
df.plot(); # 繪圖 plot
A bar plot is a graph that presents categorical data with rectangular bars.
df.iloc[5].plot(kind='bar'); # 可以使用 kind 參數指定繪圖種類 use parameter kind to specify plot style is also doable
df2 = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
# 使用 barh 繪製水平的柱狀圖 use barh to plot horizontal bar plot
df2.plot.barh(stacked=True); # stacked 參數設定 True 會把資料疊在一起顯示 set stacked to True to stack the data together
Often used to display the distribution of a continuous numerical data.
df4 = pd.DataFrame({'a': np.random.randn(1000) + 1, 'b': np.random.randn(1000),
'c': np.random.randn(1000) - 1}, columns=['a', 'b', 'c'])
df4.plot.hist(alpha=0.5, bins=20, orientation='horizontal') # orientation 參數指定繪製方向
df.diff().hist(color='k', alpha=0.5, bins=50)
We can tell the skewness and outlier using boxplot. It shows both the max and min value, the median (the line in the box), and the quartiles (the box contains data from the 25th to the 75th percentile) of the data.
df = pd.DataFrame(np.random.rand(10, 5), columns=['A', 'B', 'C', 'D', 'E'])
# 指定各部分繪圖顏色 specify the color of each parts
color = {'boxes': 'DarkGreen', 'whiskers': 'DarkOrange',
'medians': 'DarkBlue', 'caps': 'Gray'}
df.plot.box(color=color, sym='r+') # sym 參數表示異常值的標記方式 sym specify the style of plotting the outlier
df.plot.box(vert=False, positions=[1, 4, 5, 6, 8]) # vert 設定是否垂直 set vert to False to plot horizontally
A scatter plot is a type of plot using Cartesian coordinates to display values for typically two variables for a set of data.
df = pd.DataFrame(np.random.rand(50, 4), columns=['a', 'b', 'c', 'd'])
df.plot.scatter(x='a', y='b');
df.plot.scatter(x='a', y='b', c='c', s=50);
本篇程式碼請參考Github。The code is available on Github.
Please let me know if there’s any mistake in this article. Thanks for reading.
Reference 參考資料:
[1] 第二屆機器學習百日馬拉松內容
[2] Visualization
[3] 給工程師的統計學及資料分析
[4] Graphical Techniques: By Problem Category
[5] 散点图
[6] Run chart
[7] Bar chart
[8] Scatter plot